Jess Goddard, Rich Pauloo, Ryan Shepherd, Noor Brody

Last updated 2022-03-27 22:23:14


Introduction

In this report, we present a nationwide water service spatial boundary layer for community water systems and explain the tiered approach taken towards this end. This report builds on a previous technical memorandum (eda_february.html) that should be read as prerequisite material to understand the broader context, background, and exploratory data analysis (EDA) that informs the approach taken herein.

The resulting national water service boundary layer is the product of a “Tiered Explicit, Match, and Model” (henceforth, TEMM) approach. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity.

  1. Whenever present, we use explicit water service boundaries (labeled data). These are spatial polygon data, typically provided at the state-level.
  2. In the absence of explicit water service boundary data, we use a matching algorithm to match PWSIDs by spatial intersection and name to TIGER place polygons. We validate this heuristic approach on explicit water service boundaries1, and find favorable results consistent with independent validation of the approach by Patterson et al. (2019).
  3. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER place polygon match (Tier 2), a statistical model trained on explicit water service boundary data (Tier 1) is used to estimate a reasonable radius at provided PWSID centroids, notwithstanding centroid accuracy issues that remain intractable at this point in time (centroid accuracy is discussed in the Limitations section with recommendations for future work).

Below is a conceptual diagram of the three tiers in the TEMM approach. Tier 1 explicit boundaries always supersede Tier 2 matched proxy boundaries, which in turn always supersede Tier 3 modeled boundaries. Thus, the resulting the resulting water service boundary layer described in this report is combination of all three tiers, and depends on data availability for the water system, and whether or not it matches to a TIGER place.

In the sections that follow, we summarize our approach for each of the three tiers. As we move from Tier 1 to Tier 3, uncertainty increases, hence, we describe each Tier with increasing detail and provide measures of validation and uncertainty for Tier 2 and 3 estimates.

Finally, we encourage the reader to consider the TEMM water service boundary layer not as a final product, but rather, one that may be improved with the assimilation of additional Tier 1 explicit data, improvements to the Tier 2 matching algorithm, and refinement of the Tier 3 model. Ultimately, this entire workflow may be superseded by nationwide Tier 1 data, which would reduce the problem we address in the scope of work to one of simple Tier 1 data assimilation and cleaning.


Report outline

library(tidyverse)
library(sf)
library(fs)
library(rcartocolor)
library(geofacet)
library(mapview)

# don't allow fgb streaming: delivers self-contained html
mapviewOptions(fgb = FALSE)

# load environmental variable for staging path 
staging_path <- Sys.getenv("WSB_STAGING_PATH")

# read TEMM spatial output for key resul summary stats below
temm <- here("temm_layer/2022-03-27_temm.geojson") %>% 
  st_read(quiet = TRUE) %>% 
  # drop this column because it's read as a list and causes mapview trouble
  select(-service_area_type_code)

# Tier 1 labeled data
wsb_labeled_clean <- path(staging_path, "wsb_labeled_clean.geojson") %>% 
  st_read(quiet = TRUE)

# service connection count cutoff for community water systems
n_max <- 15

# read the clean matched output and perform minor transforms for plots.
# this has the same number of rows as `temm`, but contains Tier 1 "radius"
d <- read_csv(path(staging_path, "matched_output_clean.csv")) %>%
  mutate(
    # transform radius from m to km
    radius = radius/1000,
    # indicate the tier to use for each pwsid
    tier = case_when(
      has_labeled_bound == TRUE ~ "Tier 1",
      has_labeled_bound == FALSE & !is.na(tiger_match_geoid) ~ "Tier 2",
      has_labeled_bound == FALSE & is.na(tiger_match_geoid)  ~ "Tier 3"
    )
  ) %>% 
  # filter to CWS and assume each connection must serve at least 1 person
  # this drop 267 rows (0.5% of data)
  filter(service_connections_count >= n_max,
         population_served_count   >= n_max) %>% 
  # remove 834 rows (1.5% of data) not in contiguous US, mostly Puerto Rico
  filter(primacy_agency_code %in% state.abb) 

# popultion served by all water systems
pop_total <- sum(d$population_served_count)

# calculate count and proportion of people served by each tier
pop <- d %>% 
  group_by(tier) %>% 
  summarize(
    count = format_bign(sum(population_served_count)),
    prop  = round((sum(population_served_count)/pop_total)*100, 2)
  )

The following outline reflects a summary of key findings, followed by a description of each Tier in the TEMM approach. Notably, the TEMM may be refined and re-run as new data sources are ingested or improvements are made to matching and modeling algorithms.

Key Results

  • A nationwide water service boundary layer is provided for 45,564 community water systems covering a total population of 309,531,362.
  • Population coverage rates per Tier:
    • Tier 1: 39.53% population covered (122,360,150 people)
    • Tier 2: 52.66% population covered (163,009,796 people)
    • Tier 3: 7.81% population covered (24,161,416 people)
  • Unequal Tier distribution across states suggests that the confidence in water service boundary location and extent varies by state and system.
  • Favorable uncertainty metrics for Tier 2 and 3 estimates suggest that the TEMM approach provides a reasonable preliminary water service boundary layer.

Tier 1: Explicit boundaries

  • Data discovered for 12 states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA).
  • These data are considered “high quality” and used to model Tier 3 boundaries.

Tier 2: Matched TIGER proxy boundaries

  • 65.79% of water systems (29,975 / 45,564) match a TIGER place by spatial intersection or name match.
  • Spatial and name matching is constrained within the reported state to prevent out-of-state mismatches.
  • When a system has multiple spatial or name matches, minimum spatial distance between the reported PWSID centroid and the matched TIGER Place centroid is used to break ties and assign a single proxy boundary.

Tier 3: Modeled boundaries

  • Random Forest, xgboost, and multiple linear regression models were fit to Tier 1 labeled boundaries to develop a model that estimates water system boundaries at all systems nationwide.
  • Circular water service boundary estimates were then calculated for all water systems; median, 5% and 95% confidence intervals reflect medium, low, and high water service boundary estimates.
  • The the multiple linear regression fit has an impressive \(R^2 = 0.67\) and is mostly explained by service connection count, which intuitively makes sense: there is roughly linear scaling between service connection count and the spatial extent of a water service area. EDA, Random Forest, and xgboost models confirm the importance of service connection count as an explanatory variable.

Contributions

  • The significance and broader impact of the TEMM open source, nationwide water service boundary later is briefly discussed.
  • The reader is oriented towards contributor guides and a vision for the continual development of the project is outlined.

Recommendations

  • Directions for future work to address the limitations of the TEMM approach are briefly discussed, in particular:
    • Centroid accuracy and the “pancake problem” may be solved by “re-centering” low-accuracy centroids with city/town spatial databases.
    • Some primacy agency codes and state codes report conflicting information: for example, a water system may be spatially located (via spatial coordinates) in state A, but have primacy agency or state codes for state B. This may be resolved by


Key Results

The key result of this study is a nationwide water service boundary layer. Here we show the proportion of population served by each TEMM Tier at nationwide and statewide scales.


Tier coverage - nationwide

In total, the TEMM data layer represents tap water delivery to 309.53 million people served by 45,564 water systems2.

About 122 million people are covered by Tier 1 spatial data – impressive given that only 12 states that provided explicit boundary data. However, this relatively high coverage rate is unsurprising because these states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA) include notable centers of high-population like CA, TX, and PA.

Together, around 285 million people (92.19% of the population) are covered by either a Tier 1 or a Tier 2 spatial boundary. The remaining approximately 24 million people (7.81%) are covered by a Tier 3 boundary. These results indicate high confidence in the spatial accuracy of the resulting TEMM water boundary layer for community water systems.

pop %>% 
  kable(col.names = c("Tier", "Population count", 
                      "Population proportion (%)")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Tier Population count Population proportion (%)
Tier 1 122,360,150 39.53
Tier 2 163,009,796 52.66
Tier 3 24,161,416 7.81


Tier coverage - statewide

Next, we show the proportion of population covered per TEMM Tier on a state-by-state basis. Notably:

  • When Tier 1 data is present (orange bars below), it tends to cover a majority of the population.
  • Tier 2 (blue bars below) covers more population than Tier 3 across all states.
  • Tier 3 coverage (grey bars below) is not uniform across states. This implies that the Tier 2 matching algorithm may perform better in some states, and may depend on factors that vary across states, like centroid accuracy and water system type (e.g., municipal).
# dataframe for barplot geofacet: population prop served by each tier
dg <- d %>% 
  group_by(state_code) %>%
  # population per state
  mutate(state_pop = sum(population_served_count)) %>%
  ungroup() %>% 
  group_by(state_code, tier) %>% 
  # calculate count/proportion of population served per state and tier
  summarise(
    count = sum(population_served_count),
    prop  = count/state_pop
  ) %>% 
  ungroup() %>% 
  distinct() %>% 
  # add missing NA values per state code and tier group
  complete(state_code, tier) %>% 
  mutate(tier = factor(tier, levels = paste("Tier", 3:1)))

# sanity check the grouped summary above: this should return all 1
# group_by(dg, state_code) %>% 
#   summarise(s = sum(prop, na.rm = TRUE)) %>% 
#   pull(s) 

# geofacet of TEMM tier coverage per state in terms of population proportion
dg %>% 
  ggplot(aes(tier, prop, fill = tier)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  geofacet::facet_geo(~state_code) +
  labs(x = "", y = "Proportion of Population Covered") +
  guides(fill = "none") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())


A data table of the the above plot is provided below.

dg %>% 
  mutate(
    count = ifelse(is.na(count), 0, count),
    count = format_bign(count),
    prop  = ifelse(is.na(prop), 0, prop),
    prop  = round(prop, 3)
  ) %>% 
  dt_make()


National boundary layer

The national water service boundary layer is too large (i.e., around 400 MB) to plot in this interactive report, thus we show a static map below to illustrate the coverage provided by the Tiers 1-3 in the proportions described above, and direct the reader to the TEMM spatial layer in the project repository.

include_graphics(here("src/analysis/sandbox/model_explore/etc/temm-nation.png"))

Here, we zoom into California, Nevada, and Oregon – three states with high proportions of Tier 1, 2, and 3 spatial boundaries respectively. The Tier 3 circular buffers in this interactive map represent median (50th percentile) model estimates. Note that when clicking on polygons in the map below, only Tier 2 and 3 data

# plot CA, NV, OR
temm %>% 
  filter(primacy_agency_code %in% c("CA", "NV", "OR")) %>% 
  select(pwsid, service_connections_count, tier) %>% 
  mapview::mapview(zcol = "tier", burst = TRUE)

Next, we review the construction of each Tier.


Tier 1: Explicit boundaries

Tier 1 boundaries are the most straightforward to understand: states provide explicit data, and when they do, these boundaries tend to cover nearly the entire population served by water systems. Tier 1 data missingness challenges the accuracy of downstream Tier 2 and 3 boundary estimates. At the time of writing, Tier 1 data is present for 12 states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA).

# summarize missing and present states in a dataframe
states_missing <- state.abb[!state.abb %in% unique(wsb_labeled_clean$state)]
states_present <- unique(wsb_labeled_clean$state)

states_df <- tibble(
  state_abbr = c(states_missing, states_present),
  status = c(rep("missing", length(states_missing)), 
             rep("present", length(states_present)))
)

# pull usa state geometries, project to input data CRS
usa <- USAboundaries::us_states(resolution = "low") %>% 
  st_transform(epsg) %>% 
  left_join(states_df) %>% 
  tigris::shift_geometry() %>% 
  # remove Puerto Rico
  filter(!is.na(status))

# map of missing states
ggplot() +
  geom_sf(data = usa, aes(fill = status), alpha = 0.5, color = "white") +
  scale_fill_carto_d(palette = "Pastel") +
  labs(fill = "") +
  theme_void() + 
  labs(title = "Labeled data coverage by state",
       subtitle = paste("Last updated:", Sys.Date()))

We assume that data provided by states is correct, and do not clean Tier 1 data beyond assimilating the data across states into a single layer.


Tier 2: Matched boundaries

Ryan’s section with conceptual model.


Uncertainty

We estimate the uncertainty in the Tier 2 matching approach with a Modified Jacard Similarity, defined as the proportion of area overlap between the TIGER match and the underlying water service boundary. We calculate the Modified Jacard Index \(J\) by comparing Tier 2 matched TIGER geometries to explicit Tier 1 geometries.

\[ J = \frac{A_T \cap A_E }{A_E} \] where \(A_T \cap A_E\) is the intersection of the TIGER Place area (\(A_T\)) and the explicit Tier 1 water service boundary area (\(A_E\)). Thus, the quantity \(J\) is a closed interval between 0 and 1, inclusive: \(\{J \space \space | \space \space 0 \le J \le 1\}\). Importantly, \(A_T\) can fall outside of \(A_E\)\(J\) measures the proportion of coverage that \(A_T\) provides over \(A_E\).

# Tier 2 TIGER geometries for explicit boundaries
tiger <- path(staging_path, "tigris_places_clean.geojson") %>% 
  st_read(quiet = TRUE) %>% 
  select(tiger_match_geoid = geoid)

t2 <- temm %>% 
  filter(tier == "Tier 1" & !is.na(tiger_match_geoid)) %>% 
  st_drop_geometry() %>% 
  select(tiger_match_geoid, pwsid) %>% 
  left_join(tiger) %>% 
  st_as_sf() %>% 
  arrange(pwsid) %>% 
  # put into metric CRS, ensure valid geom for intersection
  st_transform(3310) %>% 
  st_make_valid() 

# Tier 1: explicit boundaries
t1 <- temm %>% 
  filter(
    tier == "Tier 1" & 
    pwsid %in% t2$pwsid
  ) %>% 
  select(pwsid) %>% 
  arrange(pwsid) %>% 
  # put into metric CRS, ensure valid geom for intersection, calc area
  st_transform(3310) %>% 
  st_make_valid() %>% 
  mutate(area_t1 = st_area(geometry)) 

# sanity check: t1 and t2 vectors are identical and ready to be compared
# identical(t1$pwsid, t2$pwsid)

# calculate intersection of t2 over t1 and calculate area
f_intersection <- function(x){
  res = st_intersection(t2[x, ], t1[x, ])
  if(nrow(res) == 0){
    res = 0
  } else {
    res = res %>% 
      mutate(area = st_area(geometry)) %>% 
      pull(area)
  }
  return(res)
}

# perform the intersection - needs to be pairwise, hence the func above
# t1$area_intersection <- map_dbl(1:nrow(t1), ~f_intersection(.x)) # slow

# calculate modified jacard
# t1$jacard <- as.numeric(t1$area_intersection / t1$area_t1)

# load pre-computed result from above b/c the previous 2 lines take time
# write_rds(t1, here("staging/t1_jacard.rds"))
t1 <- read_rds(here("staging/t1_jacard.rds"))

# sanity check that intersection worked
# j = 6
# mapview(t1[j,]) + 
#   mapview(t2[j,], col.regions = "green") + 
#   mapview(st_intersection(t2[j, ], t1[j, ]), col.regions = "red")
# t1$jacard[j]

We calculate that 62.94% of Tier 2 matched Tiger Place boundaries spatially intersect their assigned explicit Tier 1 boundary (when present). The proportion of overlap (Modified Jacard) for the 62.94% of intersecting Tier 1 and 2 geometries is generally left-skewed, which indicates high overlap. A Jacard of 1 indicates complete overlap (i.e., the TIGER Place sufficient covers the explicit water system boundary), a Jacard of 0.1 indicates that only 10% of the Tier 1 explicit boundary is covered by the Tier 2 boundary, and a Jacard of 0 indicates zero percent overlap (i.e., non-intersection).

t1 %>% 
  filter(jacard != 0) %>% 
  ggplot(aes(jacard)) +
  geom_histogram(color = "white") +
  labs(x = "Modified Jacard Similarity") +
  theme_report

36.97% of Tier 1 and 2 boundaries do not intersect, however we show that the vast majority of these geometries are very close to one another, which is unsurprising, as matches are constrained within state. The distribution of distance between centroids of non-intersecting matches is shown below.

# a substantial portion (almost half) of matched tier 2 geoms don't intersect
# tier 1 explicit boundaries but are they close by? calculate centroid distances
t1_zero <- t1 %>% 
  filter(jacard == 0) %>% 
  mutate(geometry = st_centroid(geometry)) %>% 
  arrange(pwsid)

t2_zero <- t2 %>% 
  filter(pwsid %in% t1_zero$pwsid) %>% 
  mutate(geometry = st_centroid(geometry)) %>% 
  arrange(pwsid)

# sanity check: t1 and t2 vectors are identical and ready to be compared
# identical(t1_zero$pwsid, t2_zero$pwsid)

# calculate pairwise distances between Tier 1-2 non-intersecting centroids
# with "zero jacard" (zd)
f_distance <- function(x){
  res = st_distance(t1_zero[x, ], t2_zero[x, ])
  return(as.numeric(res)/1000)
}
# zjd <- map_dbl(1:nrow(t1_zero), ~f_distance(.x)) # slow

# load pre-computed result from above because it takes a while to run
# write_rds(zjd, here("staging/zero_dist_non_intersects.rds"))
zjd <- read_rds(here("staging/zero_dist_non_intersects.rds"))

# histogram of distances in km
tibble(dist = zjd) %>% 
  ggplot(aes(dist)) +
  geom_histogram(color = "white", bins = 200) +
  labs(x = "Distance between matched Tier 2 and Tier 1 polygon centroids (km)") +
  coord_cartesian(xlim = c(0, 500)) + 
  theme_report 

The median distance between matching Tier 1 and 2 geometries is 20.35 km and the interquartile range is 11.43 - 33.46 km. This suggests that non-spatially intersecting matches are still very close together and hence, may be considered adequate preliminary water service boundaries. However, if spatial accuracy close than a few kilometers is paramount, it may be determined that Tier 3 boundaries should be used instead of Tier 2 boundaries, specifically in cases where Tier 2 boundaries do not spatially intersect the provided ECHO centroid.


Tier 3: Modeled boundaries

In the absence of explicit spatial boundaries (Tier 1) and matched TIGER proxy boundaries (Tier 2), we statistically model an estimated boundary (Tier 3). Model specification hinges on the correlation between the radius of Tier 1 convex hulls3 (the response variable), and predictors that explain this response, such as service connection count, population served, ownership type and so on.

We experimented with 3 different models: random forest, xgboost, and multiple linear regression. These different statistical and machine learning models have different assumptions and performance, which are discussed in the sections that follow.

Ultimately, we selected the multiple linear regression model as our final model because it is computationally efficient, easily interpretable, provides confidence intervals to characterize uncertainty in the modeled boundary (this may be useful depending on the application of the model results), and finally because it avoids overfitting4.


EDA

In this section of the report, we expand on important data pre-processing steps, feature selection/engineering, and developed models.


Right skewed response

The response variable (radius) is right-skewed. The median radius is 0.86 km, and the maximum radius is 53.19 km. Linear models assumes constant variance and normality, thus we log-transform the response to stabilize its variance, enable regression and inference approaches, prevent high-leverage radii from exerting too much influence on the model fit, and disallow predicted negative radii.

d %>% 
  ggplot(aes(radius)) + 
  geom_histogram(bins = 100, col = "white") +
  labs(x = "Radius (km)") +
  theme_report

The main drawback of log-transformation is that model coefficient units (e.g., slope of the regression fit) and measures of performance (e.g., RMSE) become difficult to interpret, however there are methods to handle this interpretation5. The benefits of log-transformation outweigh the drawbacks, thus we log-transform the response variable.

Moreover, to avoid over-fitting a model on a training set that over-represents the more common low-radius observations, we perform a stratified random sample on the response variable. This samples equal proportions of training observations from each of quartiles of the radius distribution, delineated by red dashed lines in the figure below.

d %>% 
  # calculate quantiles to show stratified random sampling
  mutate(
    p25 = quantile(radius, 0.25, na.rm = TRUE),
    p50 = quantile(radius, 0.50, na.rm = TRUE),
    p75 = quantile(radius, 0.75, na.rm = TRUE)
  ) %>% 
  ggplot(aes(radius)) + 
  geom_histogram(bins = 100, col= "white") + 
  geom_vline(aes(xintercept = p25), linetype = "dashed", color = "red") +
  geom_vline(aes(xintercept = p50), linetype = "dashed", color = "red") +
  geom_vline(aes(xintercept = p75), linetype = "dashed", color = "red") +
  scale_x_log10() +
  labs(x = "Radius (km)") +
  theme_report


Correlated predictors

Population served and service connections are highly correlated (\(r\) \(=\) 0.84), even across different variables like owner type code. They also both exhibit log-normality, thus, like the response variable, they are log-transformed, which further impacts the interpretation of the regression fit6. Moreover, these two correlated predictors should not be used together in a linear model (i.e., to avoid multicollinearity which reduces the precision of estimated model coefficients), so we will only use one, and try a model that multiplies both predictors to see if combining them improves the model (albeit at the loss of some interpretability). We can however, use both variables in tree-based methods.

d %>% 
  ggplot(aes(population_served_count, service_connections_count)) + 
  geom_point(aes(color = owner_type_code), alpha = 0.2) +
  rcartocolor::scale_color_carto_d() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Poulation served", 
       y = "Service connections",
       color = "Owner Type Code") +
  theme_report


Interaction effects


Owner type code

Regression slopes differ for the service connections based on owner type code. This suggests an interaction term between service connection (and population by correlation) and the owner type. Most systems are owned by local governments (L) or private (P) entities. Moreover, Native American (N) systems are poorly represented – this should be addressed in future work7.

d %>% 
  ggplot(aes(service_connections_count, radius)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~owner_type_code) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service Connections", 
       y = "Radius (km)") +
  theme_report


Wholesaler

As before with owner type code, wholesaler status interacts with service connections, thus we include an interaction effect for it in the linear model.

d %>% 
  ggplot(aes(service_connections_count, radius)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~is_wholesaler_ind) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service Connections", 
       y = "Radius (km)") +
  theme_report


Service area type code

Only 0% of observations have a missing service area type feature. At first glance, there are 1178 unique service area type codes, with many systems that show a combination of more than one type. However, upon closer inspection, service area type code combinations are often permutations of one another (e.g., “[‘RA’, ‘SC’]” and “[‘SC’, ‘RA’]”). Thus, we clean them to find truly unique service area type code combinations, select the top 5 as features for the model, lump everything else into an “Other” category, and train on this feature. We include an interaction effect for service area type code.

# clean service area type code column
d <- d %>% 
  mutate(
    # split type codes in the "python list" into chr vectors
    satc = strsplit(service_area_type_code, ", "),
    # map over the list to remove brackets ([]) and quotes (')
    satc = map(satc, ~str_remove_all(.x, "\\[|\\]|'")),
    # sort the resulting chr vector
    satc = map(satc, ~sort(.x)), 
    # collapse the sorted chr vector
    satc = map_chr(satc, ~paste(.x, collapse = ""))
  ) 

After cleaning, there are 699 truly unique service area type code combinations, and most are “RA” or “MH”. Regression slopes appear similar, but with difference y intercepts, which suggest that some service area types are larger for an equal service connection count. We may consider cleaning this further in a future iteration if domain knowledge indicates predictive power, although preliminary visual inspection (below) suggests little explanatory power.

# plot population v radius and facet by top 5 service area type code
d %>% 
  mutate(
    satc = fct_lump_n(satc, 5), 
    satc = as.character(satc), 
    satc = ifelse(is.na(satc), "Other", satc)
  ) %>%
  ggplot(aes(service_connections_count, radius)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE) + 
  facet_wrap(~satc) +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service connections", y = "Radius (km)") +
  theme_report


Models

We fit random forest, xgboost, and multiple linear regression models to a training set of 80% of the data selected by stratified random sampling to avoid overfitting to low-radius observations. We then evaluate model performance on the 20% holdout test set, discuss model uncertainty and limitations, interpret model results, and use the model to generate Tier 3 predictions on unlabeled data not covered by Tier 1 or 2 boundaries.


Random Forest

We first fit a random forest to the Tier 1 labeled data because the algorithm has less strict assumptions compared to regression, it can handle correlated predictors, and because the variable importance output quickly indicates which variables explain the most variance in the dataset. This can confirm our understanding of the relationships in the data explored in the EDA above and inform how to specify subsequent, more interpretable linear models.

We fit the rand forest regression to predict radius from service connections, population, owner type, service area type code, and wholesaler status. Service connections, population served, and owner type code appear important in segmenting the feature space, and service area type code and wholesaler status appear less predictive.

library(tidymodels)
library(vip)

# read full dataset 
d <- read_csv(path(staging_path, "matched_output_clean.csv")) 

# unlabeled data (du) and labeled data (dl)
du <- d %>% filter(is.na(radius))
dl <- d %>% filter(!is.na(radius))

# plit labeled data (dl) into train and test with stratified random sampling
# in each of the radius quartiles to account for the lognormal distribution
# of the response variable (radius) and avoid overfitting to small radius obs
set.seed(55)
dl_split <- initial_split(dl, prop = 0.8, strata = radius)
train    <- training(dl_split) 
test     <- testing(dl_split)

# model and workflow
rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_wflow <- 
  workflow() %>% 
  add_formula(
    radius ~ 
      population_served_count + 
      # importantly, the RF can have correlated predictors, so we add
      # service connections, and don't need to account for interactions
      service_connections_count + 
      owner_type_code + 
      is_wholesaler_ind + 
      satc
  ) %>% 
  add_model(rf_mod) 

# fit the random forest model
rf_fit <- fit(rf_wflow, train)

# show variable importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_report

Next we fit the model on the test set, plot residuals and calculate an \(R^2\) goodness of fit metric and two standard metrics of error, the RMSE and MAE. For our purposes, the MAE is likely more important because it doesn’t penalize outliers as much as RMSE, and we care more about the overall appropriateness of the boundary than we do about preventing large outlying errors.

# predict on test set
rf_test_res <- test %>% 
  # select(radius) %>% 
  bind_cols(predict(rf_fit, test)) 

# plot residuals
rf_test_res %>% 
  ggplot(aes(log10(radius), log10(.pred), color = owner_type_code)) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)",
       color = "Owner Type") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report

# calculate goodness of it and error metrics
rf_metrics <- metric_set(rsq, rmse, mae)
rf_metrics(rf_test_res, truth = log10(radius), estimate = log10(.pred)) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6571340
rmse 0.3857154
mae 0.3092298


xgboost

Like random forest, xgboost is a tree-based method that is relatively easy to fit, and unconstrained by linear model assumptions. Variable importance metrics were consistent with those examined in the random forest model. We tuned xgboost hyperparamaters across a grid of inputs (see plot below, used to select the best xgboost model) and actually found that – on the holdout test set – it performed worse than the random forest in terms of goodness of fit and error. At first surprising, this result makes sense given the nature of the more flexible xgboost model, the lack of adequate data that a more flexible model like xgboost would benefit from, and the underlying functional form of the relationship we aim to model.

# load hyperparameter tune space
xgb_res <- read_rds(here("src/analysis/sandbox/model_explore/etc/xgb_res.rds"))

# visualize model performance across tuning grid
xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, min_n:learn_rate) %>%
  pivot_longer(min_n:learn_rate,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rsq") +
  theme_report

xgboost is a flexble machine learning model: it tends to outperform random forest when there is sufficient data to segment the feature space and learn from residuals. However, in our example, we lack substantial data with only a few tens of thousands of water systems, and only 3 high importance predictors, 2 of which are highly correlated. Thus, the similar performance of xgboost and random forest may be explained by not being able to improve on random forest, and slightly overfitting to the training set, which a less flexible model like linear regression will not suffer from. In statistical terms, low flexibility models also tend to maintain low variance, meaning they perform similarly across new test sets (i.e., less bias). Finally, because the underlying functional form of the relationship being modeled is strongly linear (see the section on Interaction Effects), there is not a great advantage in straying from linear methods.

# load the best fit model
final_xgb <- read_rds(here("src/analysis/sandbox/model_explore/etc/final_xgb.rds"))

# fit the final xgboost model on training data
xgb_fit <- fit(final_xgb, train)

# show variable importance
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_report

# predict on test set
xgb_test_res <- test %>% 
  select(radius) %>% 
  bind_cols(predict(xgb_fit, test)) 

# plot residuals
xgb_test_res %>% 
  ggplot(aes(log10(radius), log10(.pred))) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report

# calculate goodness of it and error metrics
xgb_metrics <- metric_set(rsq, rmse, mae)
xgb_metrics(xgb_test_res, truth = log10(radius), estimate = log10(.pred)) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6511226
rmse 0.3891672
mae 0.3126558


Linear Regression

As explained in the review of random forest and xgboost models above, there is sound rationale for linear regression in the context of this problem. A strong (and intuitive) linear relationship is observed between Tier 1 water system radii and service connections. The linear model fit outperforms other models, is easily interpretable, and provides standard error metrics. The underlying data do not have a high number of observations that machine learning models would benefit from, and the parameter space is also not high in dimensionality. For these reasons, we select the linear model to estimate Tier 3 boundaries.

Unlike the random forest and xgboost models, we are careful to log transform log-normal predictors and response, and specify interaction terms where present. We experimented with different model specifications, including a model which combined the correlated population and service connection variables, however, this led to negligible improvement in the model fit and less interpretable model coefficients – thus in the final model specification we regress only on service connection. Interaction terms are added for owner type, service area type code, and wholesaler status. A simple linear regression on service connections alone has an \(R^2\) of 0.56; including these extra terms substantially improves the model fit and reduces test error.

# re-read dataset and log transform the response - only for linear model
d <- read_csv(path(staging_path, "matched_output_clean.csv")) %>% 
  mutate(radius  = log10(radius),
         # multiply correlated predictors - however, this has negligible 
         # impact on model fit and makes it harder to interpret, so ignore
         density = population_served_count * service_connections_count)

# unlabeled data (du) and labeled data (dl)
du <- d %>% filter(is.na(radius))
dl <- d %>% filter(!is.na(radius))

# split labeled data (dl) into train and test with stratified random sampling
# in each of the radius quartiles to account for the lognormal distribution
# of the response variable (radius) and avoid overfitting to small radius obs
set.seed(55)
dl_split <- initial_split(dl, prop = 0.8, strata = radius)
train    <- training(dl_split) 
test     <- testing(dl_split)

# lm recipe
lm_recipe <- 
  # specify the model - interaction terms come later
  recipe(
    radius ~ 
      service_connections_count + 
      owner_type_code +
      satc +
      is_wholesaler_ind,
    data = train
  ) %>% 
  # convert predictors to log10
  step_log(service_connections_count, base = 10) %>% 
  # encode categorical variables  
  step_dummy(all_nominal_predictors()) %>% 
  # specify interaction effects
  step_interact(~service_connections_count:starts_with("owner_type_code")) %>%
  step_interact(~service_connections_count:starts_with("satc")) %>%
  step_interact(~service_connections_count:starts_with("is_wholesaler_ind"))

# specify model and engine for linear model and rf
lm_mod <- linear_reg() %>% set_engine("lm")

# lm workflow
lm_wflow <- 
  workflow() %>% 
  add_model(lm_mod) %>% 
  add_recipe(lm_recipe)

# fit the linear model on the training set
lm_fit <- fit(lm_wflow, train)

# predict on the test set and bind mean predictions and CIs
lm_test_res <- test %>% 
  select(radius) %>% 
  bind_cols(predict(lm_fit, test)) %>% 
  bind_cols(predict(lm_fit, test, type = "conf_int"))

# plot residuals
lm_test_res %>% 
  ggplot(aes(radius, .pred)) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report

On the test set, linear model goodness of fit is comparable to random forest, but error metrics are notably lower.

# calculate goodness of it and error metrics
lm_metrics <- metric_set(rsq, rmse, mae)
lm_metrics(lm_test_res, truth = radius, estimate = .pred) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6562666
rmse 0.3542352
mae 0.2709955


Spatial residuals

We examine model residuals across space (i.e., Tier 3 predictions - Tier 1 explicit labels) to test for spatial autocorrelation in the model residuals. Although we do not calculate standard autocorrelation metrics, a high-level examination of the residuals across states shows that Tier 3 estimates are generally consistent with observed radii (most residuals are near 0 km - indicating strong agreement and consistent with the 1:1 predicted vs observed plot above). Notable exceptions include MO and OK, which show systematic under-estimation (negative values indicate a smaller estimated radius compared to the observed radius) on the order of around 5 km (3.1 miles).

# Tier 3: Modeled boundaries: read and construct geometry
t3 <- path(staging_path, "tier3_median.geojson") %>% st_read(quiet = TRUE)

# examine residuals
t3 <- t3 %>% filter(!is.na(radius)) %>% mutate(r = .pred - radius)

# plot
t3 %>% 
  mutate(state = str_sub(pwsid, 1, 2)) %>% 
  ggplot() +
  geom_vline(xintercept = 1, color = "black", linetype = "dashed") +
  geom_line(aes(r/1e3, color = state), stat = "density") + 
  rcartocolor::scale_color_carto_d() +
  facet_wrap(~state, scales = "free_y") +
  guides(color = "none") +
  labs(x = "Residual (km)", y = "Density") +
  theme_report


Confidence intervals

Linear model 5 and 95% confidence intervals are calculated and provided as an output. Importantly, the range of these predictions is relatively narrow, and generally less than 1 km.

d %>% 
  select(pwsid, radius) %>% 
  bind_cols(predict(lm_fit, d)) %>% 
  bind_cols(predict(lm_fit, d, type = "conf_int", level = 0.95)) %>% 
  mutate(
    # exponentiate results back to median (unbiased), and 5/95 CIs
    across(where(is.numeric), ~10^(.x)),
    # calculate CI range to plot distribution
    .pred_range = .pred_upper - .pred_lower
  ) %>% 
  ggplot(aes(.pred_range/1000)) +
  geom_histogram(bins = 500, color = "white") +
  coord_cartesian(xlim = c(0, 2)) +
  labs(x = "CI range (km)") +
  theme_report

Below is an example of one Tier 3 with a typical CI range (around 1 km). Toggle to the Esri.WorldImagery basemap to view satellite imagery of the underlying location and compare this to the estimated range of the Tier 3 estimates.

# lm CI layers
t3m_med <- st_read(path(staging_path, "tier3_median.geojson"), quiet = TRUE)
t3m_cil <- st_read(path(staging_path, "tier3_ci_upper_95.geojson"), quiet = TRUE)
t3m_ciu <- st_read(path(staging_path, "tier3_ci_lower_05.geojson"), quiet = TRUE)

# plot examples
pwsid_select <- "IN5289012" # ~1km CI range
mapview(filter(t3m_ciu, pwsid == pwsid_select), 
        col.regions = "green", layer.name = "upper CI", 
         alpha.regions = 0.3) +
mapview(filter(t3m_med, pwsid == pwsid_select), 
        col.regions = "blue", layer.name = "median",
        alpha.regions = 0.3) +
  mapview(filter(t3m_cil, pwsid == pwsid_select), 
          col.regions = "red", layer.name = "lower CI",
          alpha.regions = 0.3) 


Model Coefficients

In the linear model specified above, when both the dependent (response) variable and independent (predictor) variables are log-transformed we interpret these coefficients as the percent increase in the predictor for every 1% increase in the response. In our linear model, only the service connections are log transformed, thus we interpret the service_connections_count coefficient (0.36) as follows: for every a 1% increase in the water system radius, we see a 0.36% increase in service connection count. Thus, a 25%, 50%, and 75% increase in the radius are associated with a 9%, 16%, and 23%` increase in service connection count, respectively.

It is also worth noting the high prevalence of high p-value coefficients, which suggests that there is little evidence to support an effect from the coefficient8. Most interaction term coefficients fall into this category. By contrast, low p-value coefficients indicate predictors that, when changed, also changes the response variable.

lm_fit %>% 
  extract_fit_engine() %>% 
  tidy() %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling(full_width = FALSE)  
term estimate std.error statistic p.value
(Intercept) 1.8406891 0.1573912 11.6949970 0.0000000
service_connections_count 0.3583765 0.0644550 5.5601016 0.0000000
owner_type_code_L 0.0790793 0.1467316 0.5389384 0.5899393
owner_type_code_M -0.3394364 0.1641290 -2.0681067 0.0386512
owner_type_code_P -0.4226456 0.1463493 -2.8879231 0.0038848
owner_type_code_S 0.2534882 0.1766064 1.4353284 0.1512194
satc_MP 0.0964961 0.3167035 0.3046890 0.7606083
satc_MU 0.4275916 0.1543450 2.7703622 0.0056080
satc_Other 0.0384552 0.0641572 0.5993893 0.5489246
satc_RA 0.2830756 0.0602426 4.6989296 0.0000026
satc_SU 0.5804612 0.3072469 1.8892337 0.0588845
is_wholesaler_ind_wholesaler 0.3421842 0.0472483 7.2422596 0.0000000
service_connections_count_x_owner_type_code_L -0.0405382 0.0551911 -0.7345051 0.4626552
service_connections_count_x_owner_type_code_M 0.0038149 0.0663262 0.0575168 0.9541345
service_connections_count_x_owner_type_code_P 0.0652470 0.0553626 1.1785384 0.2386053
service_connections_count_x_owner_type_code_S -0.2148026 0.0688746 -3.1187494 0.0018205
service_connections_count_x_satc_MP -0.0261612 0.1776783 -0.1472392 0.8829457
service_connections_count_x_satc_MU 0.0187448 0.0562693 0.3331261 0.7390449
service_connections_count_x_satc_Other 0.1268602 0.0350216 3.6223476 0.0002931
service_connections_count_x_satc_RA 0.0677498 0.0339332 1.9965608 0.0458953
service_connections_count_x_satc_SU -0.0296779 0.1495852 -0.1984012 0.8427345
service_connections_count_x_is_wholesaler_ind_wholesaler -0.1093013 0.0152465 -7.1689610 0.0000000


Limitations

No model is without limitations. In this section, we highlight key limitations and identify potential future pathways to improve the model fit through additional data collection and/or cleaning.


The “Pancake Problem”

# filter for pancake stacks
pancakes <- d %>% 
  filter(geometry_quality == "County Centroid" & has_labeled_bound == FALSE & 
         is.na(tiger_match_geoid)) 

# counties of big pancakes
big_pancakes <- pancakes %>% 
  count(county_served, state_code, sort = TRUE) 

# plot
big_pancake_pwsid <- pancakes %>% 
  filter(county_served == big_pancakes$county_served[1]) %>% 
  pull(pwsid)

The “pancake problem” is fundamentally caused by low centroid accuracy (e.g., country or state centroid accuracy), and impacts PWSIDs that lack an explicit boundary (or TIGRIS match). Estimated radii for these systems result in circular buffers that overlap, like a stack of pancakes. To illustrate, we query for the biggest pancake problem (greatest number of pancakes in a single stack), which belongs to the county of DUTCHESS in NY and has 119 Tier 3 predictions.

# visualize
temm %>% 
  filter(pwsid %in% big_pancake_pwsid) %>% 
  mapview(zcol = "pwsid")

Upstream centroid accuracy issues are beyond the scope of this study, and may be addressed in coordination with federal agencies that produce FRS and ECHO databases. A recommendation for how to address the pancake problem is outlined in the Recommendations section below.


Missing Primacy Types

We lack labeled data and hence radii for tribal primacy types (and territories, although this analysis excludes them), and can’t improve the model based on this information until it is collected. Thus, Tier 3 tribal community water system boundaries are fit by models trained on state primacy types, which may introduce error in modeled boundaries if it is found that system boundaries systematically vary based on this parameter. This issue is related to missing “owner type codes” for “native” (N) water systems, described above.

d %>% 
  group_by(primacy_type) %>% 
  summarise(proportion_present = sum(!is.na(radius))/n(),
            proportion_present = round(proportion_present, 2)) %>% 
  knitr::kable(col.names = c("Primacy Type", "Proportion Present")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Primacy Type Proportion Present
State 0.32
Territory 0.00
Tribal 0.00


Recommendations

Beyond a discussion of how the TAMM spatial layer may be used (covered in the next section, Contributions) we advise the following improvements to the pipeline for creating the TEMM spatial layer. In order of importance:

  1. Collect additional Tier 1 data. Explicit boundary data supersedes Tier 2 TIGER proxy matches and Tier 3 modeled water system boundaries. Moreover, additional Tier 1 data improves Tier 3 model predictions by providing more training data for similar systems with missing boundary data or a matching TIGER place. In theory, explicit boundary data for each state across the USA would supersede all Tier 2 and 3 matches and estimates and reduce the pipeline presented herein to a simple data cleaning and assimilation task.
  2. Address the “Pancake Problem”. Low-accuracy centroids challenge the usefulness of the TEMM layer and introduce uncertainty into analyses that depend on these data. We propose an approach to re-center low-accuracy centroids:
    • To “spatially dis-aggregate” or “spread out” the pancakes to more reasonable locations, one approach is to relocate low accuracy centroids (i.e., state and county level centroid accuracy) to the centroids of cities and towns that fall within the county/state (depending on the centroid accuracy). Candidate cities for pancakes may be identified via a name match.
  3. Gather labeled boundaries for tribal systems (and territories if coverage is to be extended to these regions). The lack of data in tribal areas means that these areas are likely to be underrepresented. Coordination should target the collection of Tier 1 data for these systems.


Contributions

We developed a “Tiered Explicit, Match, and Model” (TEMM) approach to compile a nationwide water system boundary layer via an open source data pipeline. Such a dataset is unprecedented: presently, easily-accessible, machine-readable, and clean water system spatial boundary data – if available online – is siloed across states. This work brings together multiple spatial datasets where they exist, and develops algorithms and models to “fill in” missing data where it does not exist.

The spatial layer covers 47,708 community water systems and a total population of 309,531,362. Most people (285 million, 92.19% of the population) are covered by a Tier 1 or 2 spatial boundary, which have relatively high accuracy.

The data pipeline developed in this work can dynamically regenerate the TEMM spatial layer based on the addition of new Tier 1 explicit boundaries, refinement of the Tier 2 matching algorithm, improvements to the Tier 3 modeled boundaries, and changes to any other data dependencies (e.g., improved centroid location to solve the pancake problem). This pipeline is provided under the MIT License online at an open source Github repository along with output TAMM spatial layer results in geojson, shapefile, and csv format. Contributor guidelines are available on Github, and detail the processes by which contributions may be made to the project.

We imagine that the TEMM spatial layer is appropriate as an input for any analyses that depend on nationwide water service coverage boundaries.


Footnotes


  1. In a previous report (eda_february.html,“Section 6 TIGER proxy polygon appropriateness”), we show that matching PWSID to TIGER places is highly accurate: matched TIGER place polygons intersected explicit water service boundary polygons 93.16% of the time. In this report, we add more states, modify the matching algorithm, and further quantify this boolean (TRUE/FALSE) spatial intersection in terms of “percent overlap”. Thus, this older metric should be disregarded in light of new developments described herein.↩︎

  2. Some water systems were excluded from this data layer. Specifically, 267 water systems were dropped because they had reported service connections or population less than 15, which we categorize as unreasonable given that community water systems are defined as serving at least 15 connections. We also assume that each connection must serve at least one person. Next, we removed 834 water systems associated with US territories; most excluded water systems in this step are found in Puerto Rico.↩︎

  3. Labeled radii are calculated from the convex hull area of Tier 1 systems instead of simple water system area because they better represent calculated Tier 3 circular buffers.↩︎

  4. “Overfitting” is the tendency of a statistical or machine learning model to fit well to “seen” training data, but generalize poorly to “unseen” testing data. We need the model to generalize well to unseen water systems. The linear model actually results in the lowest error of the models tested, which may at first seem surprising, but is actually not considering the strong linear relationship between water system radius and service connections. One might expect machine learning approaches like Random Forest and xgboost to outperform the linear model. In fact, they do on training sets, but tend to overfit the training set and not not generalize as well to test sets. Simply put, we to not have enough of the high-dimensional data that machine learning models require to outperform classical approaches like linear regression in the context of this problem.↩︎

  5. When only the dependent/response variable is log-transformed: exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.↩︎

  6. Both dependent/response variable and independent/predictor variable(s) are log-transformed: Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) * 100 = 3.7 percent.↩︎

  7. For modeling purposes and until more data are collected, we re-classify Native American observations (\(n\) \(=\) 2) into the Public/Private category (M).↩︎

  8. The p-value in a linear regression tests the null hypothesis that the coefficient has no effect (i.e., coefficient = zero). A low p-value indicates evidence for rejecting the null hypothesis and a high p-value indicates otherwise.↩︎